52 research outputs found
Machine learning in compilers
Tuning a compiler so that it produces optimised code is a difficult task because modern processors
are complicated; they have a large number of components operating in parallel and each
is sensitive to the behaviour of the others. Building analytical models on which optimisation
heuristics can be based has become harder as processor complexity increased and this trend is
bound to continue as the world moves towards further heterogeneous parallelism. Compiler
writers need to spend months to get a heuristic right for any particular architecture and these
days compilers often support a wide range of disparate devices. Whenever a new processor
comes out, even if derived from a previous one, the compilerâs heuristics will need to be retuned
for it. This is, typically, too much effort and so, in fact, most compilers are out of date.
Machine learning has been shown to help; by running example programs, compiled in
different ways, and observing how those ways effect program run-time, automatic machine
learning tools can predict good settings with which to compile new, as yet unseen programs.
The field is nascent, but has demonstrated significant results already and promises a day when
compilers will be tuned for new hardware without the need for months of compiler expertsâ
time. Many hurdles still remain, however, and while experts no longer have to worry about
the details of heuristic parameters, they must spend their time on the details of the machine
learning process instead to get the full benefits of the approach.
This thesis aims to remove some of the aspects of machine learning based compilers for
which human experts are still required, paving the way for a completely automatic, retuning
compiler.
First, we tackle the most conspicuous area of human involvement; feature generation. In all
previous machine learning works for compilers, the features, which describe the important aspects
of each example to the machine learning tools, must be constructed by an expert. Should
that expert choose features poorly, they will miss crucial information without which the machine
learning algorithm can never excel. We show that not only can we automatically derive
good features, but that these features out perform those of human experts. We demonstrate our
approach on loop unrolling, and find we do better than previous work, obtaining XXX% of the
available performance, more than the XXX% of previous state of the art.
Next, we demonstrate a new method to efficiently capture the raw data needed for machine
learning tasks. The iterative compilation on which machine learning in compilers depends is
typically time consuming, often requiring months of compute time. The underlying processes
are also noisy, so that most prior works fall into two categories; those which attempt to gather
clean data by executing a large number of times and those which ignore the statistical validity
of their data to keep experiment times feasible. Our approach, on the other hand guarantees
clean data while adapting to the experiment at hand, needing an order of magnitude less work
that prior techniques
COLAB:A Collaborative Multi-factor Scheduler for Asymmetric Multicore Processors
Funding: Partially funded by the UK EPSRC grants Discovery: Pattern Discovery and Program Shaping for Many-core Systems (EP/P020631/1) and ABC: Adaptive Brokerage for Cloud (EP/R010528/1); Royal Academy of Engineering under the Research Fellowship scheme.Increasingly prevalent asymmetric multicore processors (AMP) are necessary for delivering performance in the era of limited power budget and dark silicon. However, the software fails to use them efficiently. OS schedulers, in particular, handle asymmetry only under restricted scenarios. We have efficient symmetric schedulers, efficient asymmetric schedulers for single-threaded workloads, and efficient asymmetric schedulers for single program workloads. What we do not have is a scheduler that can handle all runtime factors affecting AMP for multi-threaded multi-programmed workloads. This paper introduces the first general purpose asymmetry-aware scheduler for multi-threaded multi-programmed workloads. It estimates the performance of each thread on each type of core and identifies communication patterns and bottleneck threads. The scheduler then makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor's time. We evaluate our approach using the GEM5 simulator on four distinct big.LITTLE configurations and 26 mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average depending on the hardware setup.Postprin
Parallel-Pattern Aware Compiler Optimisations:Challenges and Opportunities
This report outlines our finding that existing compilers are not aware of the pattern semantics and thus miss massive optimisation opportunities
Iterative Compilation on Mobile Devices
The abundance of poorly optimized mobile applications coupled with their
increasing centrality in our digital lives make a framework for mobile app
optimization an imperative. While tuning strategies for desktop and server
applications have a long history, it is difficult to adapt them for use on
mobile phones.
Reference inputs which trigger behavior similar to a mobile application's
typical are hard to construct. For many classes of applications the very
concept of typical behavior is nonexistent, each user interacting with the
application in very different ways. In contexts like this, optimization
strategies need to evaluate their effectiveness against real user input, but
doing so online runs the risk of user dissatisfaction when suboptimal
optimizations are evaluated.
In this paper we present an iterative compiler which employs a novel capture
and replay technique in order to collect real user input and use it later to
evaluate different transformations offline. The proposed mechanism identifies
and stores only the set of memory pages needed to replay the most heavily used
functions of the application. At idle periods, this minimal state is combined
with different binaries of the application, each one build with different
optimizations enabled. Replaying the targeted functions allows us to evaluate
the effectiveness of each set of optimizations for the actual way the user
interacts with the application.
For the BEEBS benchmark suite, our approach was able to improve performance
by up to 57%, while keeping the slowdown experienced by the user on average at
0.8%. By focusing only on heavily used functions, we are able to conserve
storage space by between two and three orders of magnitude compared to typical
capture and replay implementations.Comment: 8 pages, 8 figure
Code Translation with Compiler Representations
In this paper, we leverage low-level compiler intermediate representations
(IR) to improve code translation. Traditional transpilers rely on syntactic
information and handcrafted rules, which limits their applicability and
produces unnatural-looking code. Applying neural machine translation (NMT)
approaches to code has successfully broadened the set of programs on which one
can get a natural-looking translation. However, they treat the code as
sequences of text tokens, and still do not differentiate well enough between
similar pieces of code which have different semantics in different languages.
The consequence is low quality translation, reducing the practicality of NMT,
and stressing the need for an approach significantly increasing its accuracy.
Here we propose to augment code translation with IRs, specifically LLVM IR,
with results on the C++, Java, Rust, and Go languages. Our method improves upon
the state of the art for unsupervised code translation, increasing the number
of correct translations by 11% on average, and up to 79% for the Java -> Rust
pair with greedy decoding. With beam search, it increases the number of correct
translations by 5.5% in average. We extend previous test sets for code
translation, by adding hundreds of Go and Rust functions. Additionally, we
train models with high performance on the problem of IR decompilation,
generating programming source code from IR, and study using IRs as intermediary
pivot for translation.Comment: 9 page
- âŠ